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Abstract 

The success of many machine learning and pattern recognition methods relies heavily upon the 
identification of an appropriate distance metric on the input data. It is often beneficial to learn such a 
metric from the input training data, instead of using a default one such as the Euclidean distance. In 
this work, we propose a boosting-based technique, termed BoostMetric, for learning a quadratic 
Mahalanobis distance metric. Learning a valid Mahalanobis distance metric requires enforcing 
the constraint that the matrix parameter to the metric remains positive semidefinite. Semidefinite 
programming is often used to enforce this constraint, but does not scale well and is not easy to 
implement. BoostMetric is instead based on the observation that any positive semidefinite ma- 
trix can be decomposed into a linear combination of trace-one rank-one matrices. BoostMetric 
thus uses rank-one positive semidefinite matrices as weak learners within an efficient and scalable 
boosting-based learning process. The resulting methods are easy to implement, efficient, and can 
accommodate various types of constraints. We extend traditional boosting algorithms in that its 
weak learner is a positive semidefinite matrix with trace and rank being one rather than a classifier 
or regressor Experiments on various datasets demonstrate that the proposed algorithms compare 
favorably to those state-of-the-art methods in terms of classification accuracy and running time. 

Keywords: Mahalanobis distance, semidefinite programming, column generation, boosting, La- 
grange duality, large margin nearest neighbor 



1. Introduction 

The identification of an effective metric by which to measure distances between data points is an 
essential component of many machine learning algorithms including ^-nearest neighbor (/:NN), k- 
means clustering, and kernel regression. These methods have been applied to a range of problems, 
including image classification and retrieval (Hastie and Tibshirani, 1996; Yu et al., 2008; Jian and 
Vemuri, 2007; Xing et al., 2002; Bar-Hillel et al, 2005; Boiman et al, 2008; Frome et al., 2007) 
amongst a host of others. 

The Euclidean distance has been shown to be effective in a wide variety of circumstances. 
Boiman et al. (2008), for instance, showed that in generic object recognition with local features, 
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fcNN with a Euclidean metric can achieve comparable or better accuracy than more sophisticated 
classifiers such as support vector machines (SVMs). The Mahalanobis distance represents a gen- 
eralization of the Euclidean distance, and offers the opportunity to learn a distance metric directly 
from the data. This learned Mahalanobis distance approach has been shown to offer improved per- 
formance over Euclidean distance-based approaches, and was particularly shown by Wang et al. 
(2010b) to represent an improvement upon the method of Boiman et al. (2008). It is the prospect 
of a significant performance improvement from fundamental machine learning algorithms which 
inspires the approach presented here. 

If we let a,-, / = 1,2- • • , represent a set of points in M^, then the Mahalanobis distance, or 
Gaussian quadratic distance, between two points is 



where X ^ is a positive semidefinite (p.s.d.) matrix. The Mahalanobis distance is thus param- 
eterized by a p.s.d. matrix, and methods for learning Mahalanobis distances are therefore often 
framed as constrained semidefinite programs. The approach we propose here, however, is based 
on boosting, which is more typically used for learning classifiers. The primary motivation for the 
boosting-based approach is that it scales well, but its efficiency in dealing with large data sets is also 
advantageous. The learning of Mahalanobis distance metrics represents a specific application of a 
more general method for matrix learning which we present below. 

We are interested here in the case where the training data consist of a set of constraints upon the 
relative distances between data points. 



where dist,y measures the distance between a,- and ay. Each such constraint implies that "a; is 
closer to aj than a, is to at". Constraints such as these often arise when it is known that a, and ay- 
belong to the same class of data points while a,-,a;t belong to different classes. These comparison 
constraints are thus often much easier to obtain than either the class labels or distances between data 
elements (Schultz and Joachims, 2003). For example, in video content retiieval, faces exti'acted from 
successive frames at close locations can be safely assumed to belong to the same person, without 
requiring the individual to be identified. In web search, the results returned by a search engine 
are ranked according to the relevance, an ordering which allows a natural conversion into a set of 
constraints. 

The problem of learning a p.s.d. matrix such as X can be formulated in terms of estimating a 
projection matrix L where X = LL^. This approach has the advantage that the p.s.d. constraint 
is enforced through the parameterization, but the disadvantage is that the relationship between the 
distance measure and the parameter matrix is less direct. In practice this approach has lead to local, 
rather than globally optimal solutions, however (see (Goldberger et al., 2004) for example). 

Methods such as (Xing et al., 2002; Weinberger et al., 2005; Weinberger and Saul, 2006; Glober- 
son and Roweis, 2005) which seek X directly are able to guarantee global optimality, but at the cost 
of a heavy computational burden and poor scalability as it is not trivial to preserve the semidefini- 
teness of X during the course of learning. Standard approaches such as interior-point (IP) Newton 
methods need to calculate the Hessian. This typically requires 0{D^) storage and has worst-case 
computational complexity of approximately 0{D^-^) where D is the size of the p.s.d. matrix. This 
is prohibitive for many real-world problems. An alternating projected (sub-)gradient approach is 




(1) 



1 = {(a;,ay,ai) |dist,7 < dist,*:}, 



(2) 
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adopted in (Weinberger et al., 2005; Xing et al., 2002; Globerson and Roweis, 2005). The disad- 
vantages of this algorithm, however, are: 1) it is not easy to implement; 2) many parameters are 
involved; 3) usually it converges slowly. 

We propose here a method for learning a p.s.d. matrix labeled BoostMetric. The method 
is based on the observation that any positive semidefinite matrix can be decomposed into a lin- 
ear positive combination of trace-one rank-one matrices. The weak learner in BoostMetric is 
thus a trace-one rank-one p.s.d. matrix. The proposed BOOSTMETRIC algorithm has the following 
desirable properties: 

1. BoostMetric is efficient and scalable. Unlike most existing methods, no semidefinite pro- 
gramming is required. At each iteration, only the largest eigenvalue and its corresponding 
eigenvector are needed. 

2. BoostMetric can accommodate various types of constraints. We demonstrate the use of 
the method to leani a Mahalanobis distance on the basis of a set of proximity comparison 
constraints. 

3. Like AdaBoost, BoostMetric does not have any pai^ameter to tune. The user only needs to 
know when to stop. Also like AdaBoost it is easy to implement. No sophisticated optimiza- 
tion techniques are involved. The efficacy and efficiency of the proposed BOOSTMETRIC is 
demonstrated on various datasets. 

4. We also propose a totally-corrective version of B OOStMetric. As in TotalBoost (Warmuth 
et al., 2006) the weights of all the selected weak learners (rank-one matrices) are updated at 
each iteration. 

Both the stage-wise BoostMetric and totally-corrective BoostMetric methods are very 
easy to implement. 

The primary contributions of this work are therefore as follows: 1) We extend traditional boost- 
ing algorithms such that each weak learner is a matrix with the trace and rank of one — which must 
be positive semidefinite — rather than a classifier or regressor; 2) The proposed algorithm can be 
used to solve many semidefinite optimization problems in machine learning and computer vision. 
We demonstrate the scalability and effectiveness of our algorithms on metric learning. Part of this 
work appeared in Shen et al. (2008, 2009). More theoretical analysis and experiments are included 
in this version. Next, we review some relevant work before we present our algorithms. 

1.1 Related Work 

Distance metric learning is closely related to subspace methods. Principal component analysis 
(PCA) and linear discriminant analysis (LDA) are two classical dimensionality reduction tech- 
niques. PCA finds the subspace that captures the maximum variance within the input data while 
LDA tries to identify the projection which maximizes the between-class distance and minimizes the 
within-class variance. Locality presei-ving projection (LPP) finds a linear projection that preserves 
the neighborhood structure of the data set (He et al., 2005). Essentially, LPP linearly approximates 
the eigenfunctions of the Laplace Beltrami operator on the underlying manifold. The connection 
between LPP and LDA is also revealed in (He et al., 2005). Wang et al. (2010a) extended LPP to 
supervised multi-label classification. Relevant component analysis (RCA) (Bar-Hillel et al., 2005) 
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learns a metric from equivalence constraints. RCA can be viewed as extending LDA by incorpo- 
rating must-link constraints and cannot-link constraints into the learning procedure. Each of these 
methods may be seen as devising a linear projection from the input space to a lower-dimensional 
output space. If this projection is characterized by the matrix L, then note that these methods may 
be related to the problem of interest here by observing X = LL^. This typically implies that X is 
rank-deficient. 

Recently, there has been significant research interest in supervised distance metric learning using 
side information that is typically presented in a set of pairwise constraints. Most of these methods, 
although appearing in different foimats, share a similar essential idea: to learn an optimal dis- 
tance metric by keeping training examples in equivalence constraints close, and at the same time, 
examples in in-equivalence constraints well separated. Previous work of (Xing et al., 2002; Wein- 
berger et al., 2005; Jian and Vemuri, 2007; Goldberger et al., 2004; Bai--Hillel et al., 2005; Schultz 
and Joachims, 2003) fall into this category. The requirement that X must be p.s.d. has led to the 
development of a number of methods for learning a Mahalanobis distance which rely upon con- 
strained semidefinite programing. This approach has a number of limitations, however, which we 
now discuss with reference to the problem of learning a p.s.d. matrix from a set of constraints upon 
pairwise-distance comparisons. Relevant work on this topic includes (Bai^-Hillel et al., 2005; Xing 
et al., 2002; Jian and Vemuri, 2007; Goldberger et al., 2004; Weinberger et al., 2005; Globerson and 
Roweis, 2005) amongst others. 

Xing et al. (2002) first proposed the idea of learning a Mahalanobis metric for clustering using 
convex optimization. The inputs are two sets: a similarity set and a dis-similarity set. The algorithm 
maximizes the distance between points in the dis-similarity set under the constraint that the distance 
between points in the similarity set is upper-bounded. Neighborhood component analysis (NCA) 
(Goldberger et al., 2004) and large margin nearest neighbor (LMNN) (Weinberger et al., 2005) learn 
a metric by maintaining consistency in data's neighborhood and keep a large margin at the bound- 
aries of different classes. It has been shown in (Weinberger and Saul, 2009; Weinberger et al., 2005) 
that LMNN delivers the state-of-the-art performance among most distance metric learning algo- 
rithms. Information theoretic metric learning (ITML) learns a suitable metric based on information 
theoretics (Davis et al., 2007). To partially alleviate the heavy computation of standard IP Newton 
methods, Bregman's cyclic projection is used in Davis et al. (2007). This idea is extended in Wang 
and Jin (2009), which has a closed-form solution and is computationally efficient. 

There have been a number of approaches developed which aim to improve the scalability of 
the process of leai^ning a metric parameterized by a p.s.d. metric X. For example, Rosales and Fung 
(2006) approximate the p.s.d. cone using a set of linear constraints based on the diagonal dominance 
theorem. The approximation is not accurate, however, in the sense that it imposes too strong a con- 
dition on the learned matrix — one may not want to leain a diagonally dominant matrix. Alternative 
optimization is used in (Xing et al., 2002; Weinberger et al., 2005) to solve the semidefinite problem 
iteratively. At each iteration, a full eigen-decomposition is appUed to project the solution back onto 
the p.s.d. cone. BoostMetric is conceptually very different to this approach, and additionally 
only requires the calculation of the first eigenvector. Tsuda et al. (2005) proposed to use matrix 
logarithms and exponentials to presei^ve positive definiteness. For the application of semidefinite 
kernel learning, they designed a matrix exponentiated gradient method to optimize von Neumann 
divergence based objective functions. At each iteration of matrix exponentiated gradient, a full 
eigen-decomposition is needed. In contrast, we only need to find the leading eigenvector. 
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The approach proposed here is directly inspired by the LMNN proposed in (Weinberger and 
Saul, 2009; Weinberger et al., 2005). Instead of using the hinge loss, however, we use the ex- 
ponential loss and logistic loss functions in order to derive an AdaBoost-like (or LogitBoost-like) 
optimization procedure. In theory, any differentiable convex loss function can be applied here. 
Hence, despite similar purposes, our algorithm differs essentially in the optimization. While the 
formulation of LMNN looks more similar to SVMs, our algorithm, termed BoostMetric, largely 
draws upon AdaBoost (Schapire, 1999). 

Column generation was first proposed by Dantzig and Wolfe (1960) for solving a paiticulai^ 
form of structured linear program with an extremely large number of variables. The general idea 
of column generation is that, instead of solving the original large-scale problem (master problem), 
one works on a restricted master problem with a reasonably small subset of the variables at each 
step. The dual of the restricted master problem is solved by the simplex method, and the optimal 
dual solution is used to find the new column to be included into the restricted master problem. LP- 
Boost (Demiriz et al., 2002) is a direct application of column generation in boosting. Significantly, 
LPBoost showed that in an LP framework, unknown weak hypotheses can be learned from the dual 
although the space of all weak hypotheses is infinitely laige. Shen and Li (2010) applied column 
generation to boosting with general loss functions. It is these results that underpin BoostMetric. 

The remaining content is organized as follows. In Section 2 we present some preliminary math- 
ematics. In Section 3, we show the main results. Experimental results are provided in Section 
4. 

2. Preliminaries 

We introduce some fundamental concepts that ai^e necessaiy for setting up our problem. First, the 
notation used in this paper is as follows. 

2.1 Notation 

Throughout this paper, a matrix is denoted by a bold upper-case letter (X); a column vector is 
denoted by a bold lower-case letter (jc). The ith row of X is denoted by X,-: and the ith column X:,-. 
1 and are column vectors of I's and O's, respectively. Their size should be clear from the context. 
We denote the space ofDxD symmetric matrices by S^, and positive semidefinite matrices by S^. 
Tr(-) is the trace of a symmetric matrix and (X,Z) = Tr(XZ^) = ^;yX,yZ,j calculates the inner 
product of two matrices. An element-wise inequality between two vectors like u <v means m; < v,- 
for all /. We use X ;^ to indicate that matrix X is positive semidefinite. For a matrix X G S^, the 
following statements are equivalent: 1) X ^ (X G §^); 2) All eigenvalues of X are nonnegative 
iXi{X) > 0, / = 1, • • • ,D); and 3) V« € u^Xu > 0. 

2.2 A Theorem on Trace-one Semidefinite Matrices 

Before we present our main results, we introduce an important theorem that serves the theoretical 
basis of BoostMetric. 

Definition 2.1. For any positive integer m, given a set of points {xi ,...,Xm} in a real vector or matrix 
space Sp, the convex hull of Sp spanned by m elements in Sp is defined as: 
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Define the linear convex span of Sp as: 

Conv(Sp) = IJConv^(Sp) = {e1iW,-jc,- 

m 

Here Z+ denotes the set of all positive integers. 

Definition 2.2. Let us define Fi to be the space of all positive semidefinite matrices X G with 
trace equahng one: 

n = {X|X^O,Tr(X) = l}; 
and to be the space of all positive semidefinite matrices with both trace and rank equaling one: 

»Fi ={Z|Z:^0,Tr(Z) = l,Rank(Z) = 1}. 

We also define r2 as the convex hull of i.e., 

r2 = Conv(»Pi). 

Lemma 2.1. Let ^2 be a convex polytope defined a-s = {X, G | X^t > 0, VA: = 0, • • • , D, 1 = 
1 }, then the points with only one element equaling one and all the others being zeros are the extreme 
points (vertexes) 0/^2- the other points can not be extreme points. 

Proof. Without loss of generality, let us consider such a point X.' = {1,0, •• ■ ,0}. If %! is not an 
extreme point of ^^2, then it must be possible to express it as a convex combination of a set of 
other points in ^^2: X' = YliLyWi^^', Wj > 0, Y!IL\Wi = 1 and X,' 7^ X'. Then we have equations: 
-£'11^ WiXi = 0, V)t = 2, • ■ • ,D. It follows that ^ = 0, V/ and it = 2, ■ • • ,D. That means, X\ = 1 V/. 
This is inconsistent with X' / X'. Therefore such a convex combination does not exist and X,' must 
be an extreme point. It is trivial to see that any X that has more than one active element is an convex 
combination of the above-defined extreme points. So they can not be extreme points. ■ 

Theorem 2.1. Fi equals to T2; i.e., Fi is also the convex hull of^'\. In other words, all Z € 
form the set of extreme points ofT\. 

Proof. It is easy to check that any convex combination £, w;Z,-, such that Z,- G resides in Fi, 
with the following two facts: 1) a convex combination of p.s.d. matrices is still a p.s.d. matrix; 2) 
Tr(i:,w,-Z,-) =i:,w,-Tr(Z,-) = l. 

By denoting > • • • > > the eigenvalues of a Z G Fi, we know that < 1 because 
Ya=\ = 'rr(Z) = 1. Therefore, all eigenvalues of Z must satisfy: X,- G [0, 1], V/ = 1, • • ■ ,D and 
£f X,- = 1. By looking at the eigenvalues of Z and using Lemma 2.1, it is immediate to see that a 
matrix Z such that Z ;>= 0, Tr(Z) = 1 and Rank(Z) > 1 can not be an extreme point of Fi. The only 
candidates for extreme points are those rank-one matrices (Xi = 1 and X2,...,z) = 0). Moreover, it is 
not possible that some rank-one matrices are extreme points and others are not because the other 
two constraints Z)^0 and Tr(Z) = 1 do not distinguish between different rank-one matrices. 

Hence, all Z G form the set of extreme points of Fi . Furthermore, Fi is a convex and compact 
set, which must have extreme points. The Krein-Mihnan Theorem (Krein and Milman, 1940) tells 
us that a convex and compact set is equal to the convex hull of its extreme points. ■ 

1. With slight abuse of notation, we also use the symbol Conv(-) to denote convex span. In general it is not a convex 
hull. 



w, 



> 0,I"iiW,- = l,Xi G Sp,m G Z+l . 
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This theorem is a special case of the results from (Overton and Womersley, 1992) in the context 
of eigenvalue optimization. A different proof for the above theorem's general version can also be 
found in (Fillmore and Williams, 1971). 

In the context of semidefinite optimization, what is of interest about Theorem 2.1 is as follows: 
it tells us that a bounded p.s.d. matrix constraint X € Fi can be equivalently replaced with a set of 
constrains which belong to F2. At the first glance, this is a highly counterintuitive proposition be- 
cause F2 involves many more complicated constraints. Both w,- and Z, (V/ = 1, • • • ,m) are unknown 
variables. Even worse, m could be extremely (or even infinitely) large. Nevertheless, this is the type 
of problems that boosting algorithms are designed to solve. Let us give a brief ovei^view of boosting 
algorithms. 

2.3 Boosting 

Boosting is an example of ensemble learning, where multiple learners are trained to solve the same 
problem. Typically a boosting algorithm (Schapire, 1999) creates a single stiong learner by incre- 
mentally adding base (weak) learners to the final strong learner. The base learner has an important 
impact on the strong learner. In general, a boosting algorithm builds on a user-specified base learn- 
ing procedure and runs it repeatedly on modified data that are outputs from the previous iterations. 

The general form of the boosting algorithm is sketched in Algorithm 1. The inputs to a boosting 
algorithm are a set of training example JC, and their corresponding class labels y. The final output is 
a strong classifier which takes the form 

Fw{x) = Z]=\Wjhj{x). (3) 

Here /i^ (•) is a base learner. From Theorem 2. 1, we know that a matrix X G Fi can be decomposed 
as 

X = lj^iWyZ,-,Z,-GF2. (4) 

By observing the similarity between Equations (3) and (4), we may view Z,j as a weak classifier 
and the matrix X as the strong classifier that we want to leam. This is exactly the problem that 
boosting methods have been designed to solve. This observation inspires us to solve a special type 
of semidefinite optimization problem using boosting techniques. 

The sparse greedy approximation algorithm proposed by Zhang (2003) is an efficient method for 
solving a class of convex problems, and achieves fast convergence rates. It has also been shown that 
boosting algorithms can be interpreted within the general framework of (Zhang, 2003). The main 
idea of sequential greedy approximation, therefore, is as follows. Given an initialization mq, which 
is in a convex subset of a linear vector space, a matrix space or a functional space, the algorithm 
finds M, and X S (0, 1) such that the objective function F((l — X)Ui-\ + hii) is minimized. Then the 
solution Ui is updated as u, = (1 — X)m;_i +?im,- and the iteration goes on. Clearly, must remain in 
the original space. As shown next, our first case, which learns a metric using the hinge loss, greatly 
resembles this idea. 

2.4 Distance Metric Learning Using Proximity Comparison 

The process of measuring distance using a Mahalanobis metric is equivalent to linearly transforming 
the data by a projection matrix L € R.^^^ (usually D>d) before calculating the standard Euclidean 
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Algorithm 1 The general framework of boosting. 



Input: Training data. 

1 Initialize a weight set u on the training examples; 

2 for j = 1,2, • • • , do 



• Receive a weak hypothesis hj{-); 

• Calculate wj > 0; 

• Update u. 

Output: A convex combination of the weak hypotheses: F^(jc) = Lj=i^y'^j(^)- 



distance: 

dist^. = IlL^a; - l7aj\\l = (a,- - ajyLl7{ai - a,-) = (a,- - ay)^X(a,- - a,-)- (5) 

As described above, the problem of learning a Mahalanobis metric can be approached in terms 
of learning the matrix L, or the p.s.d. matrix X. If X = I, the Mahalanobis distance reduces to the 
Euclidean distance. If X is diagonal, the problem corresponds to learning a metric in which different 
features are given different weights, a.k.a., feature weighting. Our approach is to learn a full p.s.d. 
matrix X, however, using BoostMetric. 

In the framework of lai^ge-mai^gin learning, we want to maximize the distance between dist,y 
and Aisiik- That is, we wish to make dist|. — dist?- = (a,- — ai:)^X(a,- — a^:) — (a, — ay)^X(a,- — ay) as 
large as possible under some regularization. To simplify notation, we rewrite the distance between 
dist?- and dist|, as dist| — dist?- = (A,.,X), where 

A,. = (a,- - a^t ) (a,- - a*: )^ - (a,- - ay ) (a,- - ay )^ , (6) 
for r = 1, • • • , |T| and \T\ is the size of the set of constraints X defined in Equation (2). 

3. Algorithms 

In this section, we define the optimization problems for metric learning. We mainly investigate the 
cases using the hinge loss, exponential loss and logistic loss functions. In order to derive an efficient 
optimization strategy, we look at their Lagrange dual problems and design boosting-like approaches 
for efficiency. 

3.1 Learning with the Hinge Loss 

Our goal is to derive a general algorithm for p.s.d. matrix learning with the hinge loss function. 
Assume that we want to find a p.s.d. matrix X ^ such that a set of constraints 

(A,,X)>0,r=l,2,-.-, 

are satisfied as well as possible. Here A^ is as defined in (6). These constraints need not all be 
strictly satisfied and thus we define the margin = (Ar,X), Vr. 

Putting it into the maximum mai^gin learning framework, we want to minimize the following 
trace norm regularized objective function: £^F((A,-,X)) +vTr(X), with f (•) a convex loss function 
and V a regularization constant. Here we have used the trace norm regularization. Of course a 



8 



Metric Learning Using Boosting-like Algorithms 



Frobenius norm regularization term can also be used here. Minimizing the Frobenius norm ||X||p, 
which is equivalent to minimize the I2 norm of the eigenvalues of X, penalizes a solution that is far 
away from the identity matrix. With the hinge loss, we can write the optimization problem as: 

max p -vi:|5i^,, s.t.: (A,,X) > p - t,Vr;X ^ 0,Tr(X) = 1; ^ >0. (7) 

Here Tr(X) = 1 removes the scale ambiguity because the distance inequalities are scale invariant. 

We can decompose X into: X = ^j^jWyZy, with Wj > 0, Rank(Zy) = 1 and Tr(Zy) = 1, 
So we have 

(A„X) = (ArXU'^j^j) = LU'^ji^'-Zj) = Ij=iw,-H,,- = H,:W,Vr. (8) 

Here Hrj is a shorthand for H^j = (Ar,Zy). Cleaiiy, Tr(X) = l^w. Using Theorem 2.1, we replace 
the p.s.d. conic constraint in the primal (7) with a linear convex combination of rank-one unitary 
matrices: X = Y^jWjZj, and l^w = 1. Substituting X in (7), we have 

max p - vi:2i^„ s.t.: H.-w > p - (r = 1, . . . , > 0, Tw = 1; ^ > 0. (9) 

The Lagrange dual problem of the above linear programming problem (9) is easily derived: 

min 71 s.t.: l'^.m^H,. < 7i1^;1^m = 1,0 < m < vl. (10) 
n,u 

We can then use column generation to solve the original problem iteratively by looking at both the 
primal and dual problems. See Shen et al. (2008) for the algorithmic details. In this work we are 
more interested in smooth loss functions such as the exponential loss and logistic loss, as presented 
in the sequel. 

3.2 Learning with the Exponential Loss 

By employing the exponential loss, we want to optimize 

minlog(l2iexp(-p,)) +vTr(X) 

s.t.:p,= (A„X),r=l,---,lZ|,X^O. (11) 

Note that: 1) We are proposing a logarithmic version of the sum of exponential loss. This transform 
does not change the original optimization problem of sum of exponential loss because the logarith- 
mic function is strictly monotonically increasing. 2) A regularization term Tr(X) has been applied. 
Without this regularization, one can always multiply X by an arbitrarily large scale factor in order 
to make the exponential loss approach zero in the case of all constraints being satisfied. This trace- 
norm regularization may also lead to low-rank solutions. 3) An auxiliary variable Pr,r= 1, . . . must 
be introduced for deriving a meaningful dual problem, as we show later. 

We now derive the Lagrange dual of the problem that we are interested in. The original problem 
(11) now becomes 

minlog(l^j2iexp(-pr)) +vl^w 

s.t.:p, = H,:W,r= I,--- ,|X|;w >0. (12) 
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We have used the Equation (8). In order to derive its dual, we write its Lagrangian 

L(w,p,«) = log(Ll2iexp(-p,.)) +vl^w + 1^12jM,-(p^-H,-:w) -p^w, (13) 
with p>0. The dual problem is obtained by finding the saddle point of L; i.e., sup^infj^pL. 

Li L2 

, ^ , ^ s 

infL = inf log (i:|2iexp(-p,)) +M^p + inf (vl^ -I:12,m,H,. -pT)^; (14) 

W,p p ^ ' ' ' Vi 

= -LSl^^rlogMr- (15) 

The infimum of L\ is found by setting its first derivative to zero and we have: 

... f-I^M^logM,- ifM>0,l^M = l, 

infLi = < (16) 
P I — oo otherwise. 

The infimum is Shannon entropy. L2 is linear in w, hence it must be 0. It leads to 

iSlMrH, < (17) 

The Lagrange dual problem of (12) is an entropy maximization problem, which writes 

max — ^'"^i Mr log Mr, s.t.: « > 0, = l,and (17). (18) 

u 

Weak and strong duality hold under mild conditions (Boyd and Vandenberghe, 2004). That means, 
one can usually solve one problem from the other. The KKT conditions link the optimal between 
these two problems. In our case, it is 

^^P^-P^) ,Vr. (19) 



lEi exp(-p^) 



While it is possible to devise a totally-corrective column generation based optimization proce- 
dure for solving our problem as the case of LPBoost (Demiriz et al., 2002), we are more interested in 
considering one-at-a-time coordinate-wise descent algorithms, as the case of AdaBoost (Schapire, 
1999). Let us start from some basic knowledge of column generation because our coordinate descent 
strategy is inspired by column generation. 

If we know all the bases Zy (7 = 1 . . .7) and hence the entire matrix H is known. Then either 
the primal (12) or the dual (18) can be trivially solved (at least in theory) because both are convex 
optimization problems. We can solve them in polynomial time. Especially the primal problem is 
convex minimization with simple nonnegativeness constraints. Off-the-shelf software like LBFGS- 
B (Zhu et al., 1997) can be used for this purpose. Unfortunately, in practice, we do not access all 
the bases: the possibility of Z is infinite. In convex optimization, column generation is a technique 
that is designed for solving this difficulty. 

Column generation was originally advocated for solving large scale linear programs (Liibbecke 
and Desrosiers, 2005). Column generation is based on the fact that for a linear- program, the number 
of non-zero variables of the optimal solution is equal to the number of constraints. Therefore, 
although the number of possible variables may be large, we only need a small subset of these in 
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the optimal solution. For a general convex problem, we can use column generation to obtain an 
approximate solution. It works by only considering a small subset of the entire variable set. Once 
it is solved, we ask the question:"Are there any other variables that can be included to improve 
the solution?". So we must be able to solve the subproblem: given a set of dual values, one either 
identifies a variable that has a favorable reduced cost, or indicates that such a variable does not exist. 
Essentially, column generation finds the variables with negative reduced costs without explicitly 
enumerating all vaiiables. 

Instead of directly solving the primal problem (12), we find the most violated constraint in the 
dual (18) iteratively for the current solution and adds this constraint to the optimization problem. 
For this purpose, we need to solve 

Z = argmaxz {l£iMr(A„Z>, s.t.: Z G . (20) 

We discuss how to efficiently solve (20) later. Now we move on to derive a coordinate descent 
optimization procedure. 

3.3 Coordinate Descent Optimization 

We show how an AdaBoost-like optimization procedure can be derived. 

3.3.1 Optimizing FOR 

Since we are interested in the one-at-a-time coordinate-wise optimization, we keep w\ , W2, • • • , Wy_i 
fixed when solving for Wj. The cost function of the primal problem is (in the following derivation, 
we drop those terms irrelevant to the variable Wj) 

Cp(wy) = log [i:|2iexp(-p/"^) •exp(-H,yw^-)] +vwj. 

Clearly, Cp is convex in wy and hence there is only one minimum that is also globally optimal. The 
first derivative of Cp w.r.t. wj vanishes at optimality, which results in 

iSi (H.y - v)ui-' exp(-w,-H,y) = 0. (21) 

If H^y is discrete, such as {+1, — 1} in standard AdaBoost, we can obtain a closed-form solution 
similai- to AdaBoost. Unfortunately in our case, H,.y can be any real value. We instead use bisection 
to seaixh for the optimal wy. The bisection method is one of the root-finding algorithms. It repeat- 
edly divides an interval in half and then selects the subinterval in which a root exists. Bisection is a 
simple and robust, although it is not the fastest algorithm for root-finding. Algorithm 2 gives the bi- 
section procedure. We have utilized the fact that the l.h.s. of (21) must be positive at w/. Otherwise 
no solution can be found. When wj = 0, clearly the l.h.s. of (21) is positive. 

3.3.2 Updating M 

The mle for updating u can be easily obtained from (19). At iteration j, we have 

oc exp(-p/) oc exp(-H^ywy), and YlfliK = 1, 
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Algorithm 2 Bisection search for wj. 



Input: An interval [w/,w„] known to contain the optimal value of wj and convergence 
tolerance £ > 0. 

1 repeat 

2 
3 
4 



• Wj = 0.5(w/ + w„); 

• if l.h.s. o/(21) >Othen 
|_ w/ = Wj-, 

else 

|_ Wu = Wj. 

7 until Wu — wi<£; 
Output: Wj. 



derived from (19). So once Wj is calculated, we can update u as 

ui ' exp(— H,.;W;) , , 

ui = - ^ ^,r=l,...,|I|, (22) 

z 

where z is a normalization factor so that lI-^iw/ = 1- This is exactly the same as AdaBoost. 
3.4 The Base Learning Algorithm 

In this section, we show that the optimization problem (20) can be exactly and efficiently solved 
using eigenvalue-decomposition (EVD). 

From Z and Rank(Z) = 1, we know that Z has the format: Z = w^, v G M^; and Tr(Z) = 1 
means ||v||2 = 1. We have 

<i:2i"rA„z>=v(i2i«,A.)v^. 

By denoting 

A = lj^l,UrAr, (23) 

the base learning optimization equals: 

max v^Av, s.t.: Ilvllo = 1. (24) 

V 

It is clear that the largest eigenvalue of A, ?iniax(A), and its corresponding eigenvector Vi gives the 
solution to the above problem. Note that A is symmetric. 

^max(A) is also used as one of the stopping criteria of the algorithm. Form the condition (17), 
^max(A) < V means that we are not able to find a new base matrix Z that violates (17) — the algorithm 
converges. 

Eigenvalue decompositions is one of the main computational costs in our algorithm. There 
are approximate eigenvalue solvers, which guarantee that for a symmetric matrix U and any £ > 
0, a vector v is found such that v^Uv > Xmax — £■ To approximately find the lai^gest eigenvalue 
and eigenvector can be very efficient using Lanczos or power method. We can use the MATLAB 
function eigs to calculate the largest eigenvector, which calls mex files of ARPACK. ARPACK is 
a collection of Fortran subroutines designed to solve large scale eigenvalue problems. When the 
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Algorithm 3 Positive semidefinite matrix learning with stage-wise boosting. 
Input: 

• Training set triplets (a,-,ay,ai:) € X; Compute A^,r = 1,2, • • ■ , using (6). 

• /: maximum number of iterations; 

• (optional) regulaiization parameter v; We may simply set v to a very small value, e.g. 

1 Initialize: mJ! = r = 1 • • ■ |X|; 

2 for 7 = 1,2, ■ • • ,7 do 

• Find a new base Zj by finding the largest eigenvalue (Xmax(A)) and its eigenvector of 
A in (23); 

• if ^max(A) < V then 
1^ break (converged); 

• Compute Wj using Algorithm 2; 

• Update u to obtain m/, r = 1 , • • • |X| using (22); 
Output: The final p.s.d. matrix X G R^^^^, X = ij^i wjZj. 



input matrix is symmetric, this software uses a variant of the Lanczos process called the implicitly 
restarted Lanczos method. 

Another way to reduce the time for computing the leading eigenvector is to compute an approx- 
imate EVD by a fast Monte Carlo algorithm such as the linear time SVD algorithm developed in 
(Drineas et al., 2004). 

We summarize our main algorithmic results in Algorithm 3. 

3.5 Learning with the Logistic Loss 

We have considered the exponential loss in the last content. The proposed framework is so general 
that it can also accommodate other convex loss functions. Here we consider the logistic loss, which 
penalizes mis-classifications with more moderate penalties than the exponential loss. It is believed 
on noisy data, the logistic loss may achieve better classification performance. 

With the same settings as in the case of the exponential loss, we can write our optimization 
problem as 

mini;''^,logit(pr) +vl^w 

s.t.:p^ = Hr:W,r = I,-- - ,|X|,m'>0. (25) 

Here logit(-) is the logistic loss defined as logit(z) = log(l +exp(— z)). Similarly, we derive its 
Lagrange dual as 

mml^|.Jjlogit*(-Mr) 

s-t.:L2iMrH,: <vlT, (26) 
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where logit*(-) is the Fenchel conjugate function of logit(-), defined as 

logit*(— m) = ulog{u) + (1 — M)log(l — u), (27) 

when < M < 1, and oo otherwise. So the Fenchel conjugate of logit(-) is the binary entropy function. 
We have reversed the sign of u when deriving the dual. 
Again, according to the KKT conditions, we have 

^^P^-P^) Vr, (28) 



l+exp(-p*) 

at optimality. From (28) we can also see that u must be in (0, 1). 

Similarly, we want to optimize the primal cost function in a coordinate descent way. First, let 
us find the relationship between ul and ul ^ Here j is the iteration index. From (28), it is trivial to 
obtain 

4 = — 1 , Vr. (29) 

{l/ui — l)exp(HryWy) + 1 

The optimization of wj can be solved by looking for the root of 

l}^lillrjul-v = 0, (30) 

where ul is a function of wj as defined in (29). 

Therefore, in the case of the logistic loss, to find wj, we modify the bisection search of Algo- 
rithm 2: 

• Line 3: if l.h.s. of (30) > then . . . 
and Line 7 of Algorithm 3: 

• Line 7: Update u using (29). 



3.6 Totally Corrective Optimization 

In this section, we derive a totally-corrective version of B OOStMetric, similar to the case of Total- 
Boost (Warmuth et al., 2006; Shen and Li, 2010) for classification, in the sense that the coefficients 
of all weak learners are updated at each iteration. 

Unlike the stage-wise optimization, here we do not need to keep previous weights of weak 
learners wi,W2, . . . Instead, the weights of all the selected weak learners wi,W2, ■ ■ ■ ,w j are 

updated at each iteration j. As discussed, our learning procedure is able to employ various loss 
functions such as the hinge loss, exponential loss or logistic loss. To devise a totally-corrective 
optimization procedure for solving our problem efficiently, we need to ensure the object function 
to be differentiable with respect to the variables w\,W2,--- ,Wj. Here, we use the exponential loss 
function and the logistic loss function. It is possible to use sub-gradient descent methods when a 
non-smooth loss function like the hinge loss is used. 

It is clear that solving for w is a typical convex optimization problem since it has a differentiable 
and convex function (12) when the exponential loss is used, or (25) when the logistic loss is used. 
Hence it can be solved using off-the-shelf gradient-descent solvers like L-BFGS-B (Zhu et al., 
1997). 
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Algorithm 4 Positive semidefinite matrix learning with totally coiTective boosting. 



Input: 




• Training set triplets (a,-,ay,ai:) € X; Compute Ar,r = 1,2, • • ■ , using (6). 




• /: maximum number of iterations; 




• Regularization parameter v. 


1 Initialize: mJ? = r = 1 • • • X ; 


2 for 7 = 1,2, • • • ,/ do 


3 


• Find a new base Zj by finding the largest eigenvalue (Xmax(A)) and its eigenvector of 




A in (23); 


4 


• if '**-max(A) < V then 


5 


1^ break (converged); 


6 


• Optimize for w\,W2, ■■■ ,Wjby solving the primal problem (12) when the exponential 




loss is used or (25) when the logistic loss is used; 


7 


• Update u to obtain ui,r = 1, • • • X using (19) (exponential loss) or (28) (logistic loss); 


Output: The final p.s.d. matrix X G M^^-", X = l/j=i wjZj. 



Since all the weights wi,W2, ■ ■ ■ ,w j are updated, u} on r = I . . .\X\ need not to be updated but 
re-calculated at each iteration j. To calculate ui, we use (19) (exponential loss) or (28) (logistic loss) 
instead of (22) or (29) respectively. Totally-coiTcctive BoostMetric methods are very simple to 
implement. Algorithm 4 gives the summary of this algorithm. Next, we show the convergence 
property of Algorithm 4. Formally, we want to show the following theorem. 

Theorem 3.1. Algorithm 4 makes progress at each iteration. In other words, the objective value 
is decreased at each iteration. Therefore, in the limit. Algorithm 4 solves the optimization problem 
(12) (or (25) j globally to a desired accuracy. 

Proof. Let us consider the exponential loss case of problem (12). The proof follows the same dis- 
cussion for the logistic loss, or any other smooth convex loss function. Assume that the current 
solution is a finite subset of base learners (rank-one trace-one matrices) and their corresponding lin- 
ear coefficients w. If we add a base matrix Z that is not in the current subset, and the corresponding 
w = 0, then the objective value and the solution must remain unchanged. We can conclude that the 
current learned base learners and w are the optimal solution already. 

Consider the case that this optimality condition is violated. We need to show that we can find 
a base learner Z, which is not in the current set of all the selected base learners, such that vi> > 
holds. Now assume that Z is the base learner found by solving (24), and the convergence condition 
^max(A) < V is not satisfied. So, we have Xniax(A) = (^Ll-=i"rAr,Z^ > v. 

If, after this weak learner Z is added into the primal problem, the primal solution remains 
unchanged, i.e., the corresponding w = 0, then from the optimality condition that L2 in (14) must be 

zero, we know that p = v — (^fliUf-A.r^Z^ < 0. This contradicts the fact the Lagrange multiplier 
P>0. 
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We can conclude that after the base learner Z is added into the primal problem, its corresponding 
w must admit a positive value. It means that one more free variable is added into the problem and 
re-solving the primal problem would reduce the objective value. Hence a strict decrease in the 
objective is guaranteed. So Algorithm 4 makes progress at each iteration. 

Furthermore, as the optimization problems involved ai^e all convex, there are no local optimal 
solutions. Therefore Algorithm 4 is guaranteed to converge to the global solution. 

Note that the above proof establishes the convergence of Algorithm 4 but it remains unclear 
about the convergence rate. ■ 

3.7 Multi-pass BoostMetric 

In this section, we show that BoostMetric can use multi-pass learning to enhance the perfor- 
mance. 

Our BoostMetric uses training set triplets (a,,aj,ai:) G Z as input for training. The Maha- 
lanobis distance metric X can be viewed as a linear- transformation in the Euclidean space by project- 
ing the data using matrix L (X = LL^). That is, nearest neighbors of samples using Mahalanobis 
distance metric X are the same as nearest neighbors using Euclidean distance in the transformed 
space. BoostMetric assumes that the triplets of input training set approximately represent the 
actual nearest neighbors of samples in the transformed space defined by the Mahalanobis metric. 
However, even though the triplets of BOOSTMETRIC consist of nearest neighbors of the original 
training samples, generated triplets are not exactly the same as the actual nearest neighbors of train- 
ing samples in the transformed space by L. 

We can refine the results of BOOSTMETRIC iteratively, as in the multiple-pass LMNN (Wein- 
berger and Saul, 2009): BOOSTMETRIC can estimate the triplets in the transformed space under 
a multiple-pass procedure as close to actual triplets as possible. The rule for multi-pass BOOST- 
METRIC is simple. At each pass p (p = 1,2, - ■ ■ ), we decompose the learned Mahalanobis distance 
metric Xp_i of previous pass into transformation matrix Lp. The initial matrix Li is an identity 
matrix. Then we generate the training set triplets from the set of points {L^ai, . . . ,L^a„,} where 
L = Li • L2 — Lip. The final Mahalanobis distance metric X becomes LL^ in Multi-pass BOOST- 
METRIC. 

4. Experiments 

In this section, we present experiments on data visualization, classification and image retrieval tasks. 
4.1 An Illustrative Example 

We demonstrate a data visualization problem on an artificial toy dataset (concentric circles) in Fig. 1. 
The dataset has four classes. The first two dimensions follow concentric circles while the left eight 
dimensions are all random Gaussian noise. In this experiment, 9000 triplets are generated for train- 
ing. When the scale of the noise is large, PCA fails find the first two informative dimensions. 
LDA fails too because clearly each class does not follow a Gaussian distraction and their centers 
overlap at the same point. The proposed BOOSTMETRIC algorithm find the informative features. 
The eigenvalues of X learned by BOOSTMETRIC are {0.542,0.414,0.007,0, ■• • ,0}, which indi- 
cates that BoostMetric successfully reveals the data's underlying 2D structure. We have used 
the exponential loss in this experiment. 
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Table 1: Comparison of test classification error rates (%) of a 3-nearest neighbor classifier on benchmark datasets. Results of NCA are not 
available either because the algorithm does not converge or due to the out-of-memory problem. BoostMetric-E indicates BOOST- 
Metric with the exponential loss and BoostMetric-L is BoostMetric with the logistic loss; both use stage-wise optimization. 
"MP" means Multiple-Pass BoostMetric and "TC" is BoostMetric with totally corrective optimization. We report computa- 
tional time as well. 
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Figure 1: The data are projected into 2D with PC A (left), LDA (middle) and BoostMetric 
(right). Both PCA and LDA fail to recover the data stmcture. The local structure of 
the data is preserved after projection by BoostMetric. 



4.2 Classification on Benctimark Datasets 

We evaluate BOOSTMETRIC on 7 datasets of different sizes. Some of the datasets have very high 
dimensional inputs. We use PCA to decrease the dimensionality before training on these datasets 
(MNIST, USPS and yFaces). PCA pre-processing helps to eliminate noises and speed up computa- 
tion. Table 1 summarizes the datasets in detail. We have used USPS and MNIST handwritten digits, 
Yale face recognition datasets, and a few UCI machine learning datasets^ . 

Experimental results are obtained by averaging over 10 runs (except for large datasets MNIST 
and Letter). We randomly split the datasets for each run. We have used the same mechanism 
to generate training triplets as described in (Weinberger et al., 2005). Briefly, for each training 
point a,, k nearest neighbors that have same labels as y,- (targets), as well as k nearest neighbors 
that have different labels from v; (imposers) are found. We then construct triplets from a, and its 
corresponding targets and imposers. For all the datasets, we have set ^ = 3 (3-nearest-neighbor). We 
have compared our method against a few methods: RCA (Bar-Hillel et al., 2005), NCA (Goldberger 
et al., 2004), ITML (Davis et al., 2007) and LMNN (Weinberger et al., 2005). Also in Table 1, 
"Euclidean" is the baseline algorithm that uses the standard Euclidean distance. The codes for these 
compared algorithms are downloaded from the corresponding author's website. Experiment setting 
for LMNN follows (Weinberger et al., 2005). The slack variable parameter for ITML is tuned using 
cross vahdation over the values 0.01,0.1, 1, 10 as in (Davis et al., 2007). For BOOSTMETRIC, we 
have set v = 10^^, the maximum number of iterations J = 500. 

BoostMetric has different variants which use 1) the exponential loss (BoostMetric-E), 2) 
the logistic loss (BoostMetric-L), 3) multiple pass evaluation (MP) for updating triplets with the 
exponential and logistic loss, and 4) two optimization strategies, namely, stage-wise optimization 
and totally corrective optimization. The experiments are conducted by using Matlab and a C-mex 
implementation of the L-BFGS-B algorithm. 

As reported in Table 1, we can conclude: 1) BoostMetric consistently improves the accu- 
racy of ^NN classification using Euclidean distance on most datasets. So learning a Mahalanobis 
metric based upon the large margin concept indeed leads to improvements in ^NN classification. 2) 
BoostMetric outperforms other state-of-the-art algorithms in most cases (on 5 out of 7 datasets). 
LMNN is the second best algorithm on these 7 data sets statistically. LMNN's results are consistent 

2. http : //archive . ics . uci . edu/ml/ 
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Table 2: Test error (%) of a 3-nearest neighbor classifier with different values of the parameter v. 

Each experiment is run 10 times. We report the mean and variance. As expected, as long 
as V is sufficiently small, in a wide range it almost does not affect the final classification 
performance. 



with those given in (Weinberger et al., 2005). ITML is faster than BoostMetric on most large 
datasets such as MNIST. However it has higher error rates than BoostMetric in our experiment. 
3) NCA can only be run on a few small data sets. In general NCA does not perform well. Initial- 
ization is important for NCA because NCA's objective function is highly non-convex and can only 
find a local optimum. 

In this experiment, LMNN solves for the global optimum (learning X) except for the Wine 
dataset. When the LMNN solver solves for X on the Wine dataset, the error rate is large (20.77% it 
14. 18%). So instead we have solved for the projection matrix L on Wine. Also note that the number 
of training data on Iris, Wine and Bal in (Weinberger et al., 2005) are different from our experiment. 
We have used these datasets from UCI. For the experiment on MNIST, if we deskew the handwritten 
digits data first as in (Weinberger and Saul, 2009), the final accuracy can be slightly improved. Here 
we have not deskewed the data. 

4.2.1 Influence OF V 

Previously, we claim that the stage-wise version of BOOSTMETRIC is parameter-free like AdaBoost. 
However, we do have a parameter v. Actually, AdaBoost simply set v = 0. The coordinate-wise gra- 
dient descent optimization strategy of AdaBoost leads to an ^i-norm regularized maximum margin 
classifier (Rosset et al., 2004). It is shown that AdaBoost minimizes its loss criterion with an ii con- 
straint on the coefficient vector. Given the similarity of the optimization of BOOSTMETRIC with 
AdaBoost, we conjecture that BoostMetric has the same property. Here we empirically prove 
that as long as v is sufficiently small, the final performance is not affected by the value of v. We have 
set V from 10-^ to 10""* and run BOOSTMETRIC on 3 UCI datasets. Table 2 reports the final 3NN 
classification eixor with different v. The results are nearly identical. 

For the totally corrective version of BOOSTMETRIC, similar results are observed. Actually for 
LMNN, it was also reported that the regularization parameter does not have a significant impact on 
the final results in a wide range (Weinberger and Saul, 2009). 

4.2.2 Computational time 

As we discussed, one major issue in learning a Mahalanobis distance is heavy computational cost 
because of the semidefiniteness constraint. 
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Figure 2: Computation time of the proposed BoostMetric (stage-wise, exponential loss) and the 
LMNN method versus the input data's dimensions on an artificial dataset. BoostMet- 
RIC is faster than LMNN with large input dimensions because at each iteration BOOST- 
METRIC only needs to calculate the largest eigenvector and LMNN needs a full eigen- 
decomposition. 



We have shown the mnning time of the proposed algorithm in Table 1 for the classification 
tasks^ . Our algorithm is generally fast. Our algorithm involves matrix operations and an EVD for 
finding its largest eigenvalue and its corresponding eigenvector. The time complexity of this EVD 
is 0{D^) with D the input dimensions. We compare our algorithm's running time with LMNN in 
Fig. 2 on the artificial dataset (concentric circles). Our algorithm is stage- wise BoostMetric with 
the exponential loss. We vaiy the input dimensions from 50 to 1000 and keep the number of triplets 
fixed to 250. LMNN does not use standard interior-point SDP solvers, which do not scale well. 
Instead LMNN heuristically combines sub-gradient descent in both the matrices L and X. At each 
iteration, X is projected back onto the p.s.d. cone using EVD. So a full EVD with time complexity 
Oip^) is needed. Note that LMNN is much faster than SDP solvers like CSDP (Borchers, 1999). 
As seen from Fig. 2, when the input dimensions are low, BoostMetric is comparable to LMNN. 
As expected, when the input dimensions become large, BOOSTMETRIC is significantly faster than 
LMNN. Note that our implementation is in Matlab. Improvements are expected if implemented in 
C/C++. 

4.3 Visual Object Categorization 

In the following experiments, unless otherwise specified, BOOSTMETRIC means the stage- wise 
BoostMetric with the exponential loss. 

The proposed BOOSTMETRIC and the LMNN are further compared on visual object cate- 
gorization tasks. The first experiment uses four classes of the Caltech-101 object recognition 
database (Fei-Fei et al., 2006), including Motorbikes (798 images). Airplanes (800), Faces (435), 

3. We have run all the experiments on a desktop with an Intel Core^'^2 Duo CPU, 4G RAM and Matlab 7.7 (64-bit 
version). 
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Figure 3: Examples of the images in the MSRC data set and the pre-segmented regions labeled 
using different colors. 



and Background-Google (520). The task is to label each image according to the presence of a par- 
ticular object. This experiment involves both object categorization (Motorbikes versus Airplanes) 
and object retrieval (Faces versus Background-Google) problems. In the second experiment, we 
compai^e the two methods on the MSRC data set including 240 images^. The objects in the images 
can be categorized into nine classes, including building, grass, tree, cow, sky, airplane, face, car 
and bicycle. Different from the first experiment, each image in this database often contains multiple 
objects. The regions corresponding to each object have been manually pre-segmented, and the task 
is to label each region according to the presence of a particular object. Some examples are shown 
in Fig. 3. 

4.3.1 Experiment on the Caltech-101 dataset 

For each image of the four classes, a number of interest regions are identified by the Harris-affine 
detector (Mikolajczyk and Schmid, 2004) and each region is characterized by the SIFT descrip- 
tor (Lowe, 2004). The total number of interest regions extracted from the four classes are about 
134,000, 84,000, 57,000, and 293,000, respectively. To accumulate statistics, the images of two 
involved object classes are randomly split as 10 pairs of training/test subsets. Restricted to the im- 
ages in a training subset (those in a test subset are only used for test), their local descriptors are 
clustered to fom visual words by using ^-means clustering. Each image is then represented by a 
histogram containing the number of occurrences of each visual word. 

Motorbikes versus Airplanes This experiment discriminates the images of a motorbike from 
those of an airplane. In each of the 10 pairs of training/test subsets, there are 959 training images 
and 639 test images. Two visual codebooks of size 100 and 200 are used, respectively. With 
the resulting histograms, the proposed BoostMetric and the LMNN are learned on a training 
subset and evaluated on the corresponding test subset. Their averaged classification error rates are 
compared in Fig. 4 (left). For both visual codebooks, the proposed BoostMetric achieves lower 
error rates than the LMNN and the Euclidean distance, demonstrating its superior peifomance. We 
also apply a linear SVM classifier with its regularization parameter carefully tuned by 5-fold cross- 
validation. Its error rates are 3.87% ±0.69% and 3.00% ±0.72% on the two visual codebooks, 
respectively. In contrast, a 3NN with BoostMetric has error rates 3.63% ±0.68% and 2.96% ± 

4. See http : / /research .microsoft . com/en-us/pro jects /object classrecognit ion/. 



21 



Shen, Kim, Wang and van den Hengel 




dim.: lOOD 200D # of triplets 



Figure 4: Test error (3-nearest neighbor) of B OOStMetric on the Motorbikes versus Airplanes 
datasets. The second plot shows the test error against the number of training triplets with 
a 100-word codebook. 



0.59%. Hence, the performance of the proposed BoostMetric is comparable to the state-of-the- 
art SVM classifier. Also, Fig. 4 (right) plots the test error of the BoostMetric against the number 
of triplets for training. The general trend is that more triplets lead to smaller enors. 

Faces versus Background-Google This experiment uses the two object classes as a retrieval 
problem. The target of retrieval is face images. The images in the class of Background-Google are 
randomly collected from the Internet and they represent the non-target class. BOOSTMETRIC is 
first learned from a training subset and retrieval is conducted on the corresponding test subset. In 
each of the 10 training/test subsets, there are 573 training images and 382 test images. Again, two 
visual codebooks of size 100 and 200 are used. Each face image in a test subset is used as a query, 
and its distances from other test images are calculated by the proposed BoostMetric, LMNN and the 
EucUdean distance, respectively. For each metric, the Precision of the retrieved top 5, 10, 15 and 
20 images are computed. The Precision values from each query are averaged on this test subset and 
then averaged over the 10 test subsets. The retrieval precision of these metrics is shown in Fig. 5 
(with a codebook size 100). As we can see that the BOOSTMETRIC consistently attains the highest 
values on both visual codebooks, which again verifies its advantages over LMNN and EucUdean 
distance. With a codebook size 200, very similar results are obtained. 

4.3.2 Experiment on the MSRC dataset 

The 240 images of the MSRC database are randomly halved into 10 groups of training and test sets. 
Given a set of training images, the task is to predict the class label for each of the pre-segmented 
regions in a test image. We follow the work in (Winn et al., 2005) to extract features and conduct 
experiments. Specifically, each image is converted from the RGB color space to the CIE Lab color 
space. First, three Gaussian low-pass filters are applied to the L, a, and b channels, respectively. 
The standard deviation a of the filters are set to 1 , 2, and 4, respectively, and the filter size is defined 
as 4a. This step produces 9 filter responses for each pixel in an image. Second, three Laplacian 
of Gaussian (LoG) filters are applied to the L channel only, with a = 1,2,4,8 and the filter size 
of 4a. This step gives rise to 4 filter responses for each pixel. Lastly, the first derivatives of the 
Gaussian filter with a = 2,4 are computed from the L channel along the row and column directions. 
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Figure 5: Retrieval accuracy of distance metric learning algorithms on the Faces versus Backgr- 
ound-Google dataset. Error bars show the standard deviation. 



respectively. This results in 4 more filter responses. After applying this set of filter banks, each 
pixel is represented by a 17 -dimensional feature vectors. All the feature vectors from a training set 
are clustered using the k-means clustering with a Mahalanobis distance^ . By setting k to 2000, a 
visual codebook of 2000 visual words is obtained. We implement the word-merging approach in 
(Winn et al., 2005) and obtain a compact and discriminative codebook of 300 visual words. Each 
pre-segmented object region is then represented as a 300-dimensional histogram. 

The proposed BoostMetric is compared with the LMNN algorithm as follows. With 10 near- 
est neighbors information, about 20,000 triplets are constructed and used to train the BoostMet- 
ric. To ensure convergence, the maximum number of iterations is set as 5000 in the optimization of 
training BoostMetric. The training of LMNN follows the default setting. kNN classifiers with 
the two learned Mahalanobis distances and the Euclidean distance are applied to each training and 
test group to categorize an object region. The categorization error rate on each test group is summa- 
rized in Table 3. As expected, both learned Mahalanobis distances achieve superior categorization 
performance to the Euclidean distance. Moreover, the proposed BoostMetric achieves better 
performance than the LMNN, as indicated by its lower average categorization eiTor rate and the 
smaller standard deviation. Also, the ^NN classifier using the proposed BoostMetric achieves 
comparable or even higher categorization performance than those reported in (Winn et al., 2005). 
Besides the categorization performance, we compare the computational efficiency of the BOOST- 
METRIC and the LMNN in learning a Mahalanobis distance. The computational time result is based 
on the Matlab codes for both methods. In this experiment, the average time cost by the B OOStMet- 
RIC for learning the Mahalanobis distance is 3.98 hours, whereas the LMNN takes about 8.06 hours 
to complete this process. Hence, the proposed BOOSTMETRIC has a shorter training process than 
the LMNN method. This again demonstrates the computational advantage of the BoostMetric 
over the LMNN method. 



5. Note that this Mahalanobis distance is different from the one that we are going to leam with the BoostMetric. 
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Table 3: Comparison of the categorization performance. 




Figure 6: Four generated triplets based on the pairwise information provided by the LFW data set. 

For the three images in each triplet, the first two belong to the same individual and the 
third one is a different individual. 



4.4 Unconstrained Face Recognition 

We use the "labeled faces in the wild" (LFW) dataset (Huang et al., 2007) for face recognition in 
this experiment. 

This is a data set of unconstrained face images, which has a large range of variations seen in real 
world, including 13,233 images of 5,749 people collected from news articles on Internet. The face 
recognition task here is pair matching — given two face images, to determine if these two images 
are of the same individual. So we classify unseen pairs to determine whether each image in the 
pair indicates the same individual or not, by applying M^NN of (Guillaumin et al., 2009) instead of 
kNN. 

Features of face images are extracted by computing 3-scale, 128-dimensional SIFT descriptors 
(Lowe, 2004), which center on 9 points of facial features extracted by a facial feature descriptor, 
same as described in (Guillaumin et al., 2009). PCA is then performed on the SIFT vectors to reduce 
the dimension to between 100 and 400. 

Simple recognition systems with a single descriptor Table 4 shows our BOOSTMetric's per- 
formance by varying PCA dimensionality and the number of triplets. Increasing the number of 
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Table 4: Comparison of the face recognition accuracy (%) of our proposed BoostMetric on the 
LFW dataset by varying the PCA dimensionality and the number of triplets for each fold. 



training triplets gives slight improvement of recognition accuracy. The dimension after PCA has 
more impact on the final accuracy for this task. 

In Fig. 7, we have drawn ROC curves of other algorithms for face recognition. To obtain our 
ROC curve, M^NN has moved the threshold value across the distributions of match and mismatch 
similarity scores. Fig. 7 (a) shows methods that use a single descriptor and a single classifier only. 
As can be seen, our system using BoostMetric outperforms all the others in the literature with a 
very small computational cost. 

Complex recognition systems with one or more descriptors Fig. 7 (b) plots the performance 
of more complicated recognition systems that use hybrid descriptors or combination of classifiers. 
See Table 5 for details. We can see that the performance of our BOOSTMETRIC is close to the 
state-of-the-art. 

In particular, BoostMetric outperforms the method of (Guillaumin et al., 2009), which has 
a similai- pipeline but uses LMNN for learning a metric. This comparison also confims the impor- 
tance of learning an appropriate metric for vision problems . 

5. Conclusion 

We have presented a new algorithm, BoostMetric, to learn a positive semidefinite metric using 
boosting techniques. We have generalized AdaBoost in the sense that the weak learner of BOOST- 
METRIC is a matrix, rather than a classifier. Our algorithm is simple and efficient. Experiments 
show its better peifomance over a few state-of-the-ait existing metiic learning methods. We are 
currently combining the idea of on-line learning into BOOSTMETRIC to make it handle even larger 
data sets. 

We also want to learn a metric using BoostMetric in the semi-supervised, and multi-task 
learning setting. It has been shown in (Weinberger and Saul, 2009) that the classification perfor- 
mance can be improved by learning multiple local metrics. We will extend BOOSTMETRIC to learn 
multiple metrics. Finally, we will explore to generalize BOOSTMETRIC for solving more general 
semidefinite matrix learning problems in machine learning. 
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Figure 7: (top) ROC Curves that use a single descriptor and a single classifier, (bottom) ROC curves 
that use hybrid descriptors are plotted. Our B OOStMetric with a single classifier is 
also plotted. Each point on the cui^ves is the average over the 10 folds of rates for a fixed 
threshold. 
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